refactor: remove redundant partitioned_by_file_group file scan field#23189
Open
Phoenix500526 wants to merge 2 commits into
Open
refactor: remove redundant partitioned_by_file_group file scan field#23189Phoenix500526 wants to merge 2 commits into
Phoenix500526 wants to merge 2 commits into
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
`FileScanConfig` had two overlapping ways to declare file scan output partitioning: the `partitioned_by_file_group` bool and `output_partitioning`. Collapse them onto `output_partitioning` as the single source of truth. - Remove the `partitioned_by_file_group` field, the builder field, and the `with_partitioned_by_file_group` builder method. - `ListingTable` now derives the partition-column `Partitioning::Hash` once its file groups are finalized and passes it via `with_output_partitioning`; `hash_partitioning_from_partition_fields` is made `pub` for this. - proto already round-trips `output_partitioning`, so the now-vestigial wire bool is left unset on write and ignored on read (the proto field is kept for backward compatibility). Closes apache#23099. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
After collapsing `partitioned_by_file_group` onto `output_partitioning`, the declared Hash partitioning is now stored on the scan and therefore rendered by `DataSourceExec`'s Display. Update the affected sqllogictest expected plans accordingly. Behavior is unchanged; only the EXPLAIN text gains an `output_partitioning=Hash(...)` entry on partition-grouped scans. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
e9ed249 to
30a0d3c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
FileScanConfighad two overlapping ways to declare a file scan's outputpartitioning:
partitioned_by_file_group: bool— a shorthand meaning "the file groups areorganized by Hive partition column values, so the output is Hash-partitioned on
those columns", and
output_partitioning: Option<Partitioning>— a general, explicit declaredpartitioning (added in Add
ListingOptions::output_partitioningandFileScanConfig::output_partitioningfor pre-defined file partitioning #22657).The bool is just a lazy shorthand for one specific
output_partitioningvalue(
Partitioning::Hashover the partition columns), and every place that consumedit (
output_partitioning(),repartitioned(),create_sibling_state()) alreadychecked
output_partitioning.is_some() || partitioned_by_file_group. Keeping bothis redundant and the
ListingTablebuilder ended up setting both. This PR makesoutput_partitioningthe single source of truth.What changes are included in this PR?
Following the issue's first option ("Remove
partitioned_by_file_group"):FileScanConfig::partitioned_by_file_group, the correspondingFileScanConfigBuilderfield, and thewith_partitioned_by_file_groupbuildermethod.
ListingTable::scannow derives the partition-columnPartitioning::Hashitself (once its file groups are finalized, so the partition count is correct)
and passes it through the existing
with_output_partitioning. The previouswith_output_partitioning(declared)+with_partitioned_by_file_group(...)double-set is collapsed into one branch.
hash_partitioning_from_partition_fieldsis madepubsoListingTable(a separate crate) can reuse the derivation instead of duplicating the
column-index resolution.
output_partitioning, so no behavior is lost: thenow-vestigial
partitioned_by_file_groupwire field is left unset on write andignored on read. The field is kept in the
.protodefinition for backwardcompatibility.
output_partitioning()/create_sibling_state()/repartitioned()now keysolely off
output_partitioning.Are these changes tested?
Yes — by existing tests, updated to the new single-field model:
datafusion-datasource:test_output_partitioning_with_partition_columns,test_output_partitioning_no_partition_columns,test_declared_output_partitioning_projects_with_scan, and thefile_streamwork-stealing test
morsel_partitioned_by_file_group_keeps_files_local(whichverifies that a declared output partitioning keeps each stream's files local).
datafusion-proto:roundtrip_parquet_exec_output_partitioning(and the otherroundtrip_parquet_exec_*cases) cover the partitioning round-trip. The oldroundtrip_parquet_exec_partitioned_by_file_grouptest exercised the removedAPI and is dropped, as its coverage is subsumed by the
output_partitioninground-trip test.
All of the above pass, along with
cargo fmt --all --checkandcargo clippy --all-targets --all-features -- -D warningsfor the affectedcrates.
Are there any user-facing changes?
Yes — public API changes :
FileScanConfig::partitioned_by_file_groupfield and theFileScanConfigBuilder::with_partitioned_by_file_groupmethod. Callers shouldset
with_output_partitioning(Some(Partitioning::Hash(..)))instead (or use thenow-public
hash_partitioning_from_partition_fieldshelper).hash_partitioning_from_partition_fieldsis nowpub.Query results, optimizer decisions (e.g. eliding
RepartitionExec), and theserialized (proto) wire format are unchanged. There is one display-only
change: EXPLAIN now renders
output_partitioning=Hash(...)onDataSourceExecfor partition-grouped scans. The scan already produced that partitioning before
(it was derived lazily inside
output_partitioning()); it is now stored on theoutput_partitioningfield and therefore shown. Therepartition_subset_satisfactionandpreserve_file_partitioningslt expectedplans are updated accordingly.
cargo-semver-checkswill flag the removals as breaking, which is expected forthis cleanup.